Added nerc_invoicing repo with functions for querying invoicing Iceberg table#267
Added nerc_invoicing repo with functions for querying invoicing Iceberg table#267jimmysway wants to merge 2 commits into
Conversation
| BALANCE_COL = "Balance" | ||
|
|
||
|
|
||
| def get_invoice_dataframe() -> pd.DataFrame: |
There was a problem hiding this comment.
You should cache this using functools.cache
8b3abd3 to
73b5ca0
Compare
knikolla
left a comment
There was a problem hiding this comment.
Previous comments still unresolved.
73b5ca0 to
42f08b4
Compare
knikolla
left a comment
There was a problem hiding this comment.
Comments about using the config system and caching still applicable.
Sorry I missed those I am addressing those now. I think caching the I will instead apply it to the |
42f08b4 to
a2fbd6f
Compare
Why do you think |
I applied caching to _get_table() instead because it returns a reusable StaticTable handle (metadata + query interface), not full row data. Caching _get_table() mainly reduces repeated table-initialization overhead on every call. If we were instead to cache the dataframe itself with I guess the question is what is the purpose of this library? Are we going to use as a thin wrapper for downstream processes to run predictable queries once a month, or are we going to be using this as a primary way to interface with Iceberg where it will be used as other processes or data tools to interface and run analytics on the data itself. Currently, I think the library is going to be used as a thin wrapper. Since this module is just a thin wrapper, it should do the simple safe optimization (_get_table) and leave heavier caching to whatever service or job is using it. |
a2fbd6f to
6239016
Compare
6239016 to
4920ec7
Compare
| return grouped_df | ||
|
|
||
|
|
||
| def aggregate_by( |
There was a problem hiding this comment.
I've split the previous select_and_group into two functions aggregate_by and group_and_sum. I think the old function was doing too much and was misleading. This way group_and_sum alone is useful whenever you already have a DataFrame and just need deterministic aggregation.
QuanMPhm
left a comment
There was a problem hiding this comment.
Some comments I have for now. Will take a closer look and run some tests
31f1732 to
7ef0bf9
Compare
dee51ba to
89d21b0
Compare
| group_by: tuple[str, ...], | ||
| *, | ||
| agg_col: str, | ||
| agg_name: str = "total", |
There was a problem hiding this comment.
@knikolla Is there a strong need for the output column to have a custom name? This function already only outputs the grouped and output columns, so without agg_name, there would not be any ambiguity.
a_summed = a.groupby([INVOICE_DATE_FIELD, PROJECT_FIELD], dropna=False, as_index=False)["Balance"].agg("sum")
print(a_summed.head())
Invoice Month Project - Allocation Balance
0 2024-03 False 6.43
1 2024-03 True 403.30
2 2024-03 NaN 3.30
3 2024-04 False 1000.00
4 2024-04 True 510.00I just don't want unneeded features for now.
| expression = costs._row_filter(**filters) | ||
| assert isinstance(expression, costs.And) | ||
| assert isinstance(expression.left, costs.EqualTo) | ||
| assert isinstance(expression.right, costs.EqualTo) |
There was a problem hiding this comment.
I would suggest a stricter check, like an equality:
| expression = costs._row_filter(**filters) | |
| assert isinstance(expression, costs.And) | |
| assert isinstance(expression.left, costs.EqualTo) | |
| assert isinstance(expression.right, costs.EqualTo) | |
| expression = costs._row_filter(**filters) | |
| assert isinstance(expression, costs.And) | |
| assert expression.left == costs.EqualTo(PID, "vllm-test") | |
| ... |
There was a problem hiding this comment.
The test checks that the right type of expression was built, which I think is sufficient. Checking exact values is not really useful imho because if _row_filter garbles the values, group_and_sum and aggregate_by tests would catch it anyway since they operate end-to-end with real data.
There was a problem hiding this comment.
The test checks that the right type of expression was built, which I think is sufficient. Checking exact values is not really useful
I disagree. This unit test is the only one letting us understand the expected behavior of _row_filter. We should know exactly the expected output should be, and in this is case it is not difficult to do so.
because if _row_filter garbles the values, group_and_sum and aggregate_by tests would catch it anyway since they operate end-to-end with real data.
get_invoice_dataframe, the only function that calls _row_filter, is heavily mocked in your test cases. None of your test cases deliberately test the filtering feature because we lack an integration test. Therefore, I'll argue none of the tests you presented adequately test _row_filter
| def test_aggregate_by_rounds_and_forwards_filters( | ||
| monkeypatch: pytest.MonkeyPatch, sample_invoice_dataframe: pd.DataFrame | ||
| ): | ||
| captured: dict[str, object] = {} | ||
|
|
||
| def _fake_loader(cols=None, **filters): | ||
| captured["cols"] = cols | ||
| captured["filters"] = filters | ||
| return sample_invoice_dataframe | ||
|
|
||
| monkeypatch.setattr(costs, "get_invoice_dataframe", _fake_loader) |
There was a problem hiding this comment.
Have you considered unittest.mock, or pytest-mock to mock get_invoice_dataframe? Mocking will provide you with helper functions to capture how get_invoice_dataframe was called, removing the need for _fake_loader. There's a few examples of this mocking in this repo
This suggest would also help remove some mocking code you have further in your test cases, such as for class _FakeIcebergTable
There was a problem hiding this comment.
I think we should keep the _FakeIcebergTable. I believe having code that is understandable is better.
_FakeIcebergTable explicitly documents the .scan().select().to_pandas() chain the code depends on. Which is very useful to understand the context which is something that a MagicMock will abstract away.
There was a problem hiding this comment.
The purpose of the test case is to see if get_invoice_dataframe emits a warning if the filtered dataframe is empty. It is explicitly not concerned with the internals of the iceberg table, which is why this is mocked out in the first place. A developer would review a test case to understand the expected behavior, not the internals of how that is done.
Looking at it a bit more, I would argue asserting that the output equals the input is moot here, since you setup the mock to return the same dataframe. If we wanted to test the pyiceberg code, we would need an integration test, which can be a part of this PR or a seperate one. @knikolla Thoughts?
89d21b0 to
97a1923
Compare
|
@jimmysway Some CI test cases are broken |
9a1a7bc to
c96e05a
Compare
Added functionality to get lifetime costs grouped by project Added pyproject.toml for future publishing Added testing for new functionality
…o alert based on the "Cost" column I have made the changes to the code as necessary, switching from summing "Balance" to "Cost" instead Added equality check for row_filter test
c96e05a to
ff60637
Compare
Added nerc_invoicing repo for querying invoicing Iceberg table
Added functionality to get lifetime costs grouped by project
Added pyproject.toml for future publishing
Closes #266